library(tidyverse)
demographics = read.csv('data/demos_anonymized.csv')
ids = read.csv('data/ids_anonymized.csv')
model_variables = read.csv('data/model_variables_anonymized.csv')

Introduction

Regression

model_1 = lm(award_total_amount ~ libuser, 
             data = model_variables)

coef(model_1)
summary(model_1)

Note there are only two fitted values, one for library users and one for the non-users.

fitted(model_1)[1:10]  # first 10 predictions on current data set

Use modelr to add them to your data set.

library(modelr) # need to install?
model_variables %>% 
  add_predictions(model_1, var = 'Prediction') %>% 
  group_by(libuser) %>% 
  summarise(result = mean(Prediction))

The broom library also has functionality for many common models.

library(broom) # need to install?
augment(model_1)

Use ggeffects for a quick visual!

library(ggeffects) # need to install?
ggpredict(model_1) %>% 
  plot()

We can assess performance: - within a model (e.g. Adjusted \(R^2\)) - by comparing models (e.g. anova test) - with prediction on new data (e.g. RMSE)

model_2 = lm(award_total_amount ~ libuser + gender, 
             data = model_variables)

anova(model_1, model_2)

Other approaches (all must be done on same data!):

adjr1 = summary(model_1)$adj
adjr2 = summary(model_2)$adj

round(c(adjr1, adjr2), 2)

AIC is uninterpretable on its own, but reflects the likelihood of the data given the model and complexity. When you have multiple AIC values, the lower value is the better model.

AIC(model_1, model_2)

Classification

model_logreg = glm(award_total_amount > 5e6 ~ age, 
                   data = model_variables,
                   family = binomial)

summary(model_logreg)

The caret package has a nice function to summarize classification performance via a variety of metrics.

It needs labeled factors to use though. Often you would have this anyway.

library(caret) # need to install?

predictions =  predict(model_logreg) > 0  # same as probability > .5
predictions = factor(predictions, labels = c('low', 'high'))

target = model_variables$award_total_amount > 5e6
target = factor(target, labels = c('low', 'high'))

confusionMatrix(predictions, target)

Python examples

While Python has statistical modeling capabilities, it’s nowhere near the level of R. If you just want to stay in the Python world, you can do basic models without issue. However, it suffers from consistency of method, documentation, stability across updates etc. You might be better off simply calling R with something like rpy2 or a similar approach.

Init

# note how when using something other than R, you have to specify the engine path
import pandas as pd
import numpy as np
import statsmodels


model_variables = pd.read_csv('data/model_variables_anonymized.csv')

Statsmodels

Statsmodels is Python’s attempt to do statistical modeling. It essentially follows R’s formula style.

import statsmodels.api as sm
import statsmodels.formula.api as smf

# specify the model, then fit
model_1 = smf.ols('award_total_amount ~ libuser', data=model_variables)
results_1 = model_1.fit()

# Inspect the results
print(results_1.summary())
model_2 = smf.ols('award_total_amount ~ libuser + gender', data=model_variables)
results_2 = model_2.fit()

# Inspect the results
print(results_2.summary())
print(sm.stats.anova_lm(results_1, results_2))

Classification

model_variables['award_high'] = 1*(model_variables['award_total_amount'] > 5e6) # 1* makes it a numeric

logit_mod = smf.logit('award_high ~ libuser + gender', data = model_variables)
logit_result = logit_mod.fit()

print(logit_result.summary())
logit_result.pred_table() / model_variables.shape[0]